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Abstract. We introduce supervised latent Dirichlet allocation (sLDA), a 
^ statistical model of labelled documents. The model accommodates a va- 

riety of response types. We derive an approximate maximum- likelihood 
procedure for parameter estimation, which relies on variational meth- 
^ 1 ods to handle intractable posterior expectations. Prediction problems 

^ motivate this research: we use the fitted model to predict response val- 

ues for new documents. We test sLDA on two real- world problems: 
^ movie ratings predicted from reviews, and the political tone of amend- 

ments in the U.S. Senate based on the amendment text. We illustrate 
the benefits of sLDA versus modern regularized regression, as well as 
I versus an unsupervised LDA analysis followed by a separate regression. 
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^ 1. Introduction 
cn 

^ There is a growing need to analyze collections of electronic text. We have unprecedented access to 

I large corpora, such as government documents and news archives, but we cannot take advantage 

^ of these collections without being able to organize, understand, and summarize their contents. We 

*^ need new statistical models to analyze such data, and fast algorithms to compute with those models. 

The complexity of document corpora has led to considerable interest in applying hierarchical sta- 
tistical models based on what are called topics (Blei and Lafferty, 2009; Blei et al., 2003; Griffiths 
et al., 2005; Erosheva et al., 2004). Formally, a topic is a probability distribution over terms in a 
vocabulary. Informally, a topic represents an underlying semantic theme; a document consisting of 
a large number of words might be concisely modelled as deriving from a smaller number of topics. 
Such topic models provide useful descriptive statistics for a collection, which facilitates tasks like 
browsing, searching, and assessing document similarity. 

Most topic models, such as latent Dirichlet allocation (LDA) (Blei et al., 2003), are unsupervised. 
Only the words in the documents are modelled, and the goal is to infer topics that maximize the 
likelihood (or the posterior probability) of the collection. This is compelling — with only the doc- 
uments as input, one can find patterns of words that reflect the broad themes that run through 
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them — and unsupervised topic modeling has many appHcations. Researchers have used the topic- 
based decomposition of the collection to examine interdisciplinarity (Ramage et al., 2009), organize 
large collections of digital books (Mimno and McCallum, 2007), recommend purchased items (Mar- 
lin, 2003), and retrieve relevant documents to a query (Wei and Croft, 2006). Researchers have also 
evaluated the interpretability of the topics directly, such as by correlation to a thesaurus (Steyvers 
and Griffiths, 2006) or by human study (Chang et al., 2009). 

In this work, we focus on document collections where each document is endowed with a response 
variable, external to its words, that we are interested in predicting. There are many examples: in 
a collection of movie reviews, each document is summarized by a numerical rating; in a collection 
of news articles, each document is assigned to a section of the newspaper; in a collection of on- 
line scientific articles, each document is downloaded a certain number of times. To analyze such 
collections, we develop supervised topic models, where the goal is to infer latent topics that are 
predictive of the response. With a fitted model in hand, we can infer the topic structure of an 
unlabeled document and then form a prediction of its response. 

Unsupervised LDA has previously been used to construct features for classification. The hope was 
that LDA topics would turn out to be useful for categorization, since they act to reduce data 
dimension (Blei et al., 2003; Fei-Fei and Perona, 2005; Quelhas et al., 2005). However, when the 
goal is prediction, fitting unsupervised topics may not be a good choice. Consider predicting a 
movie rating from the words in its review. Intuitively, good predictive topics will differentiate words 
like "excellent", "terrible", and "average," without regard to genre. But topics estimated from an 
unsupervised model may correspond to genres, if that is the dominant structure in the corpus. 

The distinction between unsupervised and supervised topic models is mirrored in existing dimension- 
reduction techniques. For example, consider regression on unsupervised principal components versus 
partial least squares and projection pursuit (Hastie et al., 2001) — both search for covariate linear 
combinations most predictive of a response variable. Linear supervised methods have nonparametric 
analogs, such as an approach based on kernel ICA (Fukumizu et al., 2004). In text analysis problems, 
McCallum et al. (2006) developed a joint topic model for words and categories, and Blei and Jordan 
(2003) developed an LDA model to predict caption words from images. In chemogenomic profiling, 
Flaherty et al. (2005) proposed "labelled LDA," which is also a joint topic model, but for genes and 
protein function categories. These models differ fundamentally from the model proposed here. 

This paper is organized as follows. We develop the supervised latent Dirichlet allocation model 
(sLDA) for document-response pairs. We derive parameter estimation and prediction algorithms for 
the general exponential family response distributions. We show specific algorithms for a Gaussian 
response and Poisson response, and suggest a general approach to any other exponential family. 
Finally, we demonstrate sLDA on two real- world problems. First, we predict movie ratings based on 
the text of the reviews. Second, we predict the political tone of a senate amendment, based on an 
ideal-point analysis of the roll call data (Clinton et al., 2004). In both settings, we find that sLDA 
provides more predictive power than regression on unsupervised LDA features. The sLDA approach 
also improves on the lasso (Tibshirani, 1996), a modern regularized regression technique. 

2. Supervised latent Dirichlet allocation 

Topic models are distributions over document collections where each document is represented as 
a collection of discrete random variables W^imj which are its words. In topic models, we treat the 
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words of a document as arising from a set of latent topics, that is, a set of unknown distributions 
over the vocabulary. Documents in a corpus share the same set of K topics, but each document uses 
a mix of topics — the topic proportions — unique to itself. Topic models are a relaxation of classical 
document mixture models, which associate each document with a single unknown topic. Thus they 
are mixed-membership models (Erosheva et al., 2004). See Steyvers and Griffiths (2006) and Blei 
and Lafferty (2009) for recent reviews. 

Here we build on latent Dirichlet allocation (LDA) (Blei et al., 2003), a topic model that serves 
as the basis for many others. In LDA, we treat the topic proportions for a document as a draw 
from a Dirichlet distribution. We obtain the words in the document by repeatedly choosing a topic 
assignment from those proportions, then drawing a word from the corresponding topic. 

In supervised latent Dirichlet allocation (sLDA), we add to LDA a response variable connected 
to each document. As mentioned, examples of this variable include the number of stars given to 
a movie, the number of times an on-line article was downloaded, or the category of a document. 
We jointly model the documents and the responses, in order to find latent topics that will best 
predict the response variables for future unlabeled documents. SLDA uses the same probabilistic 
machinery as a generalized linear model to accommodate various types of response: unconstrained 
real values, real values constrained to be positive (e.g., failure times), ordered or unordered class 
labels, nonnegative integers (e.g., count data), and other types. 

Fix for a moment the model parameters: K topics (each /3fc a vector of term probabilities), a 
Dirichlet parameter a, and response parameters rj and 6. (These response parameters are described 
in detail below.) Under the sLDA model, each document and response arises from the following 
generative process: 

1. Draw topic proportions | a ~ Dir(a). 

2. For each word 

(a) Draw topic assignment Zn\9 ^ Mult(0). 

(b) Draw word Wn \ z^, Pi-.K ~ Mult(/3^„). 

3. Draw response variable y \ z\-n, ?7, <^ ~ GLM(z, r/, 5), where we define 



Figure 1 illustrates the family of probability distributions corresponding to this generative process 
as a graphical model. 

The distribution of the response is a generalized linear model (McCullagh and Nelder, 1989), 



There are two main ingredients in a generalized linear model (GLM): the "random component" and 
the "systematic component." For the random component, we take the distribution of the response 
to be an exponential dispersion family with natural parameter rf^ z and dispersion parameter 5. 
For each fixed 5, Equation 2 is an exponential family, with base measure h{y,5), sufficient statistic 
y, and log-normalizer A^rj'^ z). The dispersion parameter provides additional flexibility in modeling 
the variance of y. Note that Equation 2 need not be an exponential family jointly in (rf^ z,S). 



(1) 



z := 



(1/A^)E. 
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In the systematic component of the GLM, we relate the exponential- family parameter of the random 
component to a linear combination of covariates — the so-called linear predictor. For sLDA, we have 
already introduced the linear predictor: it is rj'^ z. The reader familiar with GLMs will recognize 
that our choice of systematic component means sLDA uses only canonical link functions. In future 
work, we will relax this constraint. 

The GLM framework gives us the flexibility to model any type of response variable whose distribu- 
tion can be written in exponential dispersion form Equation 2. This includes many commonly used 
distributions: the normal (for real response); the binomial (for binary response); the multinomial 
(for categorical response); the Poisson and negative binomial (for count data); the gamma, Weibull, 
and inverse Gaussian (for failure time data); and others. Each of these distributions corresponds 
to a particular choice of h{y,6) and A{r]'^z). For example, the normal distribution corresponds to 
h{y,6) = (l/\/27r5) exp{— and A{r]^ z) = {rf^z)'^/2. In this case, the usual Gaussian parame- 
ters, mean fi and variance o"^, are equal to r]'^ z and 6, respectively. 

What distinguishes sLDA from the usual GLM is that the covariates are the unobserved empirical 
frequencies of the topics in the document. In the generative process, these latent variables are 
responsible for producing the words of the document, and thus the response and the words are tied. 
The regression coefficients on those frequencies constitute r/. Note that a GLM usually includes an 
intercept term, which amounts to adding a covariate that always equals one. Here, such a term is 
redundant, because the components of z always sum to one (see Equation 1). 

By regressing the response on the empirical topic frequencies, we treat the response as non-exchangeable 
with the words. The document (i.e., words and their topic assignments) is generated first, under full 
word exchangeability; then, based on the document, the response variable is generated. In contrast, 
one could formulate a model in which y is regressed on the topic proportions 6. This treats the 
response and all the words as jointly exchangeable. But as a practical matter, our chosen formula- 
tion seems more sensible: the response depends on the topic frequencies which actually occurred in 
the document, rather than on the mean of the distribution generating the topics. Estimating this 
fully exchangeable model with enough topics allows some topics to be used entirely to explain the 
response variables, and others to be used to explain the word occurrences. This degrades predictive 
performance, as demonstrated in Blei and Jordan (2003). Put a different way, here the latent vari- 
ables that govern the response are the same latent variables that governed the words. The model 
does not have the flexibility to infer topic mass that explains the response, without also using it to 
explain some of the observed words. 



3. Computation with supervised LDA 

We need to address three computational problems to analyze data with sLDA. First is posterior 
inference, computing the conditional distribution of the latent variables at the document level 
given its words wi-n and the corpus-wide model parameters. The posterior is thus a conditional 
distribution of topic proportions 6 and topic assignments zi-^jy. This distribution is not possible to 
compute. We approximate it with variational inference. 

Second is parameter estimation, estimating the Dirichlet parameters a, GLM parameters ry and S, 
and topic multinomials f3i:K from a data set of observed document-response pairs {wd,i:N,yd}d=i- 
We use maximum likelihood estimation based on variational expectation-maximization. 
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Fig 1. The graphical model representation of supervised latent Dirichlet allocation (sLDA). (Nodes are random vari- 
ables; edges indicate possible dependence; a shaded node is an observed variable; an unshaded node is a hidden variable.) 



The final problem is prediction, predicting a response y from a newly observed document wi-i^ and 
fixed values of the model parameters. This amounts to approximating the posterior expectation 
¥j[y\wi;N,a,Pi;K,-r],S\. 

We treat these problems in turn for the general GLM setting of sLDA, noting where GLM-specific 
quantities need to be computed or approximated. We then consider the special cases of a Gaussian 
response and a Poisson response, for which the algorithms have an exact form. Finally, we suggest 
a general-purpose approximation procedure for other response distributions. 

3.1 Posterior inference 

Both parameter estimation and prediction hinge on posterior inference. Given a document and 
response, the posterior distribution of the latent variables is 

(3) p{9, zi;N I wi:N,y, a, /3i;i^, r/, cr^) = 

p{e I a) (lln=l Pi^ri I 0)p{Wn I Zn, /3i:k)^ p{y \ Zi:N, 7], (T^) 
Jdep{9 I a) Y^^^,j^ {\{n=l Pi^n I 0)p{Wn \ Zn, /3l:i^ piy \ Zi;N,V, O"^) 

The normalizing value is the marginal probability of the observed data, i.e., the document wi:isf and 
response y. This normalizer is also known as the likelihood, or the evidence. As with LDA, it is not 
efficiently computable (Blei et al., 2003). Thus, we appeal to variational methods to approximate 
the posterior. 

Variational methods encompass a number of types of approximation for the normalizing value of 
the posterior. For reviews, see Wainwright and Jordan (2008) and Jordan et al. (1999). We use 
mean-field variational inference, where Jensen's inequality is used to lower bound the normalizing 
value. We let vr denote the set of model parameters, vr = {a, f3i-K,il,S} and q{6,zi:]\f) denote a 
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variational distribution of the latent variables. The lower bound is 



logp('U;i:Ar I Tt) = log I p{e , Zi-N , W^N \ T^) 



Zl:N 



log j y^^p{0,Zi;N,Wi:N I Tt) 



N 



q{9,Zi;N) 

> E[logp{e,zi;N,wi;N\Tr)] -E[logq{e,zi;N)], 

where all expectations are taken with respect to q{9, ziiAt). This bound is called the evidence lower 
bound (ELBO), which we denote by C{-). The first term is the expectation of the log of the joint 
probability of hidden and observed variables; the second term is the entropy of the variational 
distribution, H(g) = —E[logq{6, zi-j\f)]. In its expanded form, the sLDA ELBO is 

C{wi..N, y = E[logp(0 I a)] + En=i E[logp(Z„ I d)] 

(4) + En=i E[logpK I Zn, ^1:k)] + E[logp{y I Zi.,N,v, S)] + H(g) . 
This is a function of the observations {wi:N,y} and the variational distribution. 

In variational inference we first construct a parameterized family for the variational distribution, 
and then fit its parameters to tighten Equation 4 as much as possible for the given observations. 
The parameterization of the variational distribution governs the expense of optimization. 

Equation 4 is tight when q{6, zi:n) is the posterior, but specifying a family that contains the posterior 
leads to an intractable optimization problem. We thus specify a simpler family. Here, we choose the 
fully factorized distribution, 

(5) q{0, Zi;N I 7, h:N) = q{0 I 7) n^=l (l{Zn \ 

where 7 is a K-dimensional Dirichlet parameter vector and each (pn parametrizes a categorical 
distribution over K elements. The latent topic assignment Z„ is represented as a ii'-dimensional 
indicator vector; thus E[Z„] = q{zn) = (pn- Tightening the bound with respect to this family is 
equivalent to finding its member that is closest in KL divergence to the posterior (Wainwright and 
Jordan, 2008; Jordan et al., 1999). Thus, given a document-response pair, we maximize Equation 4 
with respect to 4>i;N and 7 to obtain an estimate of the posterior. 

Before turning to optimization, we describe how to compute the ELBO in Equation 4. The first 
three terms and the entropy of the variational distribution are identical to the corresponding terms 
in the ELBO for unsupervised LDA (Blei et al., 2003). The first three terms are 

K K 

(6) E[\ogp{e\a)] = \ogT{Y.f^^ai)-Y,\ogT{ai) + Y,{a,-l)E[\oge,] 

i=l i=l 

K 



(7) E[\ogp{Zn\e)] = Y,(pnM^ogei\ 



1=1 

K 

(8) E\[ogp{Wn\ Zn, Pi:k)] = ^ (/>n,i log A,-u;„ • 

i=l 

The entropy of the variational distribution is 

N K K K 

(9) H(g) = - E E "^"'^ - ^ (S«=i ^) + E log r(7i) -Y.{l^- l)E[log 9^] . 

n=l i=l i=l i=l 
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We note that the expectation of the log of the Dirichlet random variable is 

(10) E[loge,] = *(7.)-^(Ef=i7,^ 
where ^{x) denotes the digamma function, i.e., the first derivative of the log of the Gamma function. 

The variational objective function for sLDA differs from LDA in the fourth term, the expected log 
probability of the response variable given the latent topic assignments. This term is 

(11) E[logp(y I Zi.,N,v, S)] = log h{y, S) + ^ [t?^ (E [Z] y)-E [A{rj'^ Z)] 
We see that computing the sLDA ELBO hinges on two expectations. The first expectation is 

1 ^ 

(12) E[Z]=^=-^cPn, 

n=l 

which follows from Z„ being an indicator vector. 

The second expectation is E[^(77'''Z)]. This is computable in some models, such as when the response 
is Gaussian or Poisson distributed, but in the general case it must be approximated. We will address 
these settings in detail in Section 3.4. For now, we assume that this issue is resolvable and continue 
with the procedure for approximate posterior inference. 

We use block coordinate-ascent variational inference, maximizing Equation 4 with respect to each 
variational parameter vector in turn. The terms that involve the variational Dirichlet 7 are identical 
to those in unsupervised LDA, i.e., they do not involve the response variable y. Thus, the coordinate 
ascent update is as in Blei et al. (2003), 

(13) 7"^"^« + Ell'An. 



The central difference between sLDA and LDA lies in the update for the variational multinomial 
(pn- Given n G {1, . . . , A^}, the partial derivative of the ELBO with respect to (pn is 

(14) — = E[log0] + E[logpiwn I Pi..k)] -logcPn + l+ (^) ^ " (^J ^ {HMv'' Z)] } . 

Optimizing with respect to the variational multinomial depends on the form of {E[A(ry'''Z)] |. 
In some cases, such as a Gaussian or Poisson response, we obtain exact coordinate updates. In other 
cases, we require gradient based optimization methods. Again, we postpone these details to Section 
3.4. 

Variational inference proceeds by iteratively updating the variational parameters {7, c^i-.n} according 
to Equation 13 and the derivative in Equation 14. This finds a local optimum of the ELBO in 
Equation 4. The resulting variational distribution q is used as a proxy for the posterior. 



3.2 Parameter estimation 



The parameters of sLDA are the K topics Pi-.k, the Dirichlet hyperparameters a, the GLM coeffi- 
cients T], and the GLM dispersion parameter 5. We fit these parameters with variational expectation 
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maximization (EM), an approximate form of expectation maximization where the expectation is 
taken with respect to a variational distribution. As in the usual EM algorithm, maximization pro- 
ceeds by maximum likelihood estimation under expected sufficient statistics (Dempster et al., 1977). 

Our data are a corpus of document-response pairs V = {wd,i:N :yd}d=i- Variational EM is an op- 
timization of the corpus-level lower bound on the log likelihood of the data, which is the sum 
of per-document ELBOs. As we are now considering a collection of document-response pairs, in 
this section we add document indexes to the previous section's quantities, so response variable y 
becomes yd, empirical topic assignment frequencies Z becomes Zd-, and so on. Notice each docu- 
ment is endowed with its own variational distribution. Expectations are taken with respect to that 
document-specific variational distribution qd{zi:N ■, 0) , 

D 

(15) C{a, /3i:i^ , r/, (5; P) = ^ Ed[logp(6'd, Zd,i:N, Wd,i:N,yd)] + H(gd)- 

d=i 

In the expectation step (E-step), we estimate the approximate posterior distribution for each 
document-response pair using the variational inference algorithm described above. In the maxi- 
mization step (M-step), we maximize the corpus- level ELBO with respect to the model parameters. 
Variational EM finds a local optimum of Equation 15 by iterating between these steps. The M-step 
updates are described below. 



Estimating the topics. The M-step updates of the topics I3i:K are the same as for unsupervised 
LDA, where the probability of a word under a topic is proportional to the expected number of times 
that it was assigned to that topic (Blei et al., 2003), 

D N 

(16) J]J]l(t.,,„ = ^/;)<^^,„. 

d=l n=l 

Proportionality means that each (3^^^ is normalized to sum to one. We note that in a fully Bayesian 
sLDA model, one can place a symmetric Dirichlet prior on the topics and use variational Bayes to 
approximate their posterior. This adds no additional complexity to the algorithm (Bishop, 2006). 



Estimating the GLM parameters. The GLM parameters are the coefficients t] and dispersion 
parameter 5. For the corpus- level ELBO, the gradient with respect to GLM coefficients r] is 

(17) I - |Q)|;Ke[z.],.-e[^(,-z.)]} 

(18) = (^)|X]<^,2/rf-X]E4Mr?^Z,)Z,]l, 

^ ^ [d=i d=i ) 

where /i(-) = Eglm[^ | ■]• The appearance of this expectation follows from properties of the cumulant 
generating function A[r]^ z) (Brown, 1986). This GLM mean response is a known function of rj^ Zd 
in all standard cases. However, Fj[iJ,{r]~^ Zd)Zd] has an exact solution only in some cases, such as 
Gaussian or Poisson response. In other cases, we approximate the expectation. (See Section 3.4.) 
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The derivative with respect to 5, evaluated at fj^ew, is 

(19) {e'-^tS^] + {}) [t Kew (E [Z4 y,) -E[Aif,l^Z,)] 



Equation 19 can be computed given that the rightmost summation has been evaluated, exactly or 
approximately, while optimizing the coefficients r]. Depending on h{y, 6) and its partial derivative 
with respect to 6, we obtain Jnew either in closed form or via one-dimensional numerical optimization. 



Estimating the Dirichlet parameter. While we fix the Dirichlet parameter a to 1/K in Section 
4, other applications of topic modeling fit this parameter (Blei et al., 2003; Wallach et al., 2009). 
Estimating the Dirichlet parameters follows the standard procedure for estimating a Dirichlet dis- 
tribution (Ronning, 1989). In a fully-observed setting, the sufficient statistics of the Dirichlet are 
the logs of the observed simplex vectors. Here, they are the expected logs of the topic proportions, 
see Equation 10. 



3.3 Prediction 

Our focus in applying sLDA is prediction. Given a new document wi:N and a fitted model {a, f3i-K, rj-, <5}, 
we want to compute the expected response value, 

(20) E[Y\wi;N,a,l3i;K,ri,a'^] = E[^(7?^Z) \wi;N,a, jii-.x]- 

To perform prediction, we approximate the posterior mean of Z using variational inference. This is 
the same procedure as in Section 3.1, but here the terms depending on the response y are removed 
from the ELBO in Equation 4. Thus, with the variational distribution taking the same form as 
Equation 5, we implement the following coordinate ascent updates for the variational parameters, 

(21) 7^^^^ = a + Eli'/'n 

(22) 4>l''" oc exp{E,[log0]+log/3^J, 

where the log of a vector is a vector of the log of each of its components, is the vector of 
p{wn I Pk) for each k G {1, . . . , X}, and again proportionality means that the vector is normalized to 
sum to one. Note in this section we distinguish between expectations taken with respect to the model 
and those taken with respect to the variational distribution. (In other sections all expectations are 
taken with respect to the variational distribution.) 

This coordinate ascent algorithm is identical to variational inference for unsupervised LDA: since 
we averaged the response variable out of the right-hand side in Equation 20, what remains is the 
standard unsupervised LDA model for Zi-;^ and 9 (Blei et al., 2003). Notice this algorithm does 
not depend on the particular response type. 

Thus, given a new document, we first compute q{6,zi:]\f), the variational posterior distribution of 
the latent variables 6 and Z„. We then estimate the response with 

(23) E[y I wi.,N, a, Pi:K, V, ~ Eg [m(?7^Z)] 

As with parameter estimation, this depends on being able to compute or approximate Eq[/i(r/^Z)]. 
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3.4 Examples 

The previous three sections have outlined the general computational strategy for sLDA: a procedure 
for approximating the posterior distribution of the topic proportions 9 and topic assignments Zi-i\f 
on a per-document/response basis, a maximum likelihood procedure for fitting the topics f3i-K and 
GLM parameters {ry, 6} using variational EM, and a procedure for predicting new responses from 
observed documents using an approximate expectation based on a variational posterior. 

Our derivation has remained free of any specific assumptions about the form of the response distri- 
bution. We now focus on sLDA for specific response distributions — the Gaussian and Poisson — and 
suggest a general approximation technique for handling other responses within the GLM framework. 
The response distribution blocked our derivation at three points: the computation of E[A(r/^Z)] in 
the per-document ELBO of Equation 4, the computation of {E[A(r/'''Z)] } in Equation 14 and 

corresponding update for the variational multinomial (pm and the computation oiFjq[fi{r]~^ 2^)2^] for 
fitting the GLM parameters in a variational EM algorithm. We note that other aspects of working 
with sLDA are general to any type of exponential family response. 



Gaussian response When the response is Gaussian, the dispersed exponential family form can 
be seen as follows: 

1 f {y-v'^z)'^\ 



(24) 



(25) 



p{y\z,r],d) 



V2^ 
1 



exp 



exp 



26 j ' 

-y^/2 + yrj^z — {r]^zz^ 7])/2 ^ 
6 



Here the natural parameter r/^ z and mean parameter are identical. Thus, ^{rf^ z) = r]^ z. The usual 
variance parameter o"^ is equal to 6. Moreover, notice that 



(26) 
(27) 



A{r]^z) 



1 



V2^ 

{r]^zz^ri)/2 



exp{-yV2} 



We first specify the variational inference algorithm. This requires computing the expectation of the 
log normalizer in Equation 27, which hinges on E \^22~^^^ , 

1 



(28) 
(29) 



1 



+ En=idiag{ (/>„}) . 



To see Equation 29, notice that for m ^ n, E[Z„Z^] = E[Z„]E[Zm]^ = </>n0m because the variational 
distribution is fully factorized. On the other hand, E[Z„Zj] = diag(E[Z„]) = diag(^n) because Z„ 
is an indicator vector. 

To take the derivative of the expected log normalizer, we consider it as a function of a single 
variational parameter (pj. Define (p-j := 'Yln^j ^n- Expressed as a function of E^A^r]^ 2)'j is 

1 



(30) 
(31) 



2iV2 
1 

2iV2 



7] 



+ 



+ diag{(j)j} 



2{ri'^(l)_j)7j^(Pj + {rjo7])'^(Pj 



rj + const 
+ const . 
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Thus the gradient is 

(32) ^EiAir^-^Z)] = ^ [2{r^'^ + {rj o r^) 

Substituting this gradient into Equation 14 yields an exact coordinate update for the variational 
multinomial 

(33) </)f ™ oc exp|E[loge] + E[logp(u;, | /3i..k)] + (^) V " ^ [Hv'^^~j)v + iv o v)]^ 

Exponentiating a vector means forming the vector of exponentials. The proportionality symbol 
means the components of (/>"®™ are computed according to Equation 33, then normalized to sum to 
one. Note that E[log0i] is given in Equation 10. 

Examining this update in the Gaussian setting uncovers further intuitions about the difference be- 
tween sLDA and LDA. As in LDA, the jth word's variational distribution over topics depends on 
the word's topic probabilities under the actual model (determined by /3i:i<-). But Wj^s variational 
distribution, and those of all other words, affect the probability of the response. To see this, con- 
sider the expectation of logp(y | z,r],5) of Equation 24, which is is a term in the document-level 
ELBO of Equation 4. Notice that variational distribution q{zn) plays a role in the expected residual 
sum of squares E[(y — r]^ Z)"^]. The end result is that the update Equation 33 also encourages the 
corresponding variational parameter to decrease this expected residual sum of squares. 

Further, the update in Equation 33 depends on the variational parameters of all other words. 
Unlike LDA, the cannot be updated in parallel. Distinct occurrences of the same term must be 
treated separately. 

We now turn to parameter estimation for the Gaussian response sLDA model, i.e., the M-step 
updates for the parameters rj and 5. Define y := as the vector of response values across 

documents and let X be the D x K design matrix, whose rows are the vectors Z^. Setting to zero 
the r] gradient of the corpus-level ELBO from Equation 18, we arrive at an expected-value version 
of the normal equations: 

(34) E[X^X]7] = E[X]'^y f,^,^ ^ (e[X^ X])'^ E[X]'^ y . 



Here the expectation is over the matrix X, using the variational distribution parameters chosen in 
the previous E-step. Note that the dth row of E[X] is just E[Z(i] from Equation 12. Also, 

(35) E[X^X]=^E[Z,Zj], 

d 

with each term having a fixed value from the previous E-step as well, given by Equation 29. 
We now apply the first-order condition for 5 of Equation 19. The partial derivative needed is 

dh{yd,6)/d6 1 



(36) 



h{yd,5) 26' 



which can be seen from Equation 27. Using this derivative and the definition of r/new in Equation 
19, we obtain 

(37) Jnew ^ ^ jy^y- 2/^E[X] (e[X^ X]y' E[X]'^ y^ . 
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It is tempting to try a further simplification of Equation 37: in an ordinary least squares (OLS) 
regression of y on the columns of E[X], the analog of Equation 37 would just equal 1/D times the 
sum of the squared residuals {y — E[X]f/ncw)- However, that identity does not hold here, because 
the inverted matrix in Equation 37 is E[X~'^X], rather than E[X]'''E[X]. This illustrates that the rj 
update Equation 34 is not just OLS regression of y on E[X]. 

Finally, we form predictions just as in Section 3.3. Since the mapping from the natural parameter 
to the mean parameter is the identity function, 

(38) E[Y\wi:N,a,/3i..K,r],6]^r]'^E[Z]. 

Again, the expectation is taken with respect to variational inference in the unsupervised setting 
(Equation 21 and Equation 22). 



Poisson response An overdispersed Poisson response provides a natural generalized linear model 
formulation for count data. Given mean parameter A, the overdispersed Poisson has density 

(39) piy\X,5) = -\y/'exp{-X/S}. 
This can be put in the overdispersed exponential family form 

(40) p[y I A) = — exp ' 



In the GLM, the natural parameter log A = rj'^z, h{y,6) = 1/yl, and 
(41) A{r]^z) = niv^z) = exp{r/^z} . 



We first compute the expectation of the log normalizer 



(42) 
(43) 

where 

(44) 
(45) 



E 



E[exp{(l/Ar)Et,^T^^} 

N 

Y[E[exp{{l/N)ri'^Zn}], 



n=l 



E[exp{(l/iV)r/Tz„}] = J2i=i ^n,iew{r]i/N} + (l - cPn,^) 



K -l + Etl(^n,^exp{l^i/N} . 



The product in Equation 43 also provides E[fi{r]^ Z)], which is needed for prediction in Equation 
20. Denote this product by C and let C_„ denote the same product, but over all indices except for 
the nth. The derivative of the expected log normalizer needed to update (pn is 



(46) 



d 



E[A{rj'^Z)] =C_„exp{?7/7V} . 



As for the Gaussian response, this permits an exact update of the variational multinomial. 
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In the derivative of the corpus- level ELBO with respect to the coefficients ij (Equation 18), we need 
to compute the expected mean parameter times Z, 

1 ^ 

(47) nf4r]''Z)Z] = -^E[^(r/Tz)Z„] 



iV 

n=l 



(48) = ?5!i?Mj;c 



N 



n=l 



We turn to the M-step. We cannot find an exact M-step update for the GLM coefficients, but we can 
compute the gradient for use in a convex optimization algorithm. The gradient of the corpus-level 
ELBO is 

\ d d n / 

The derivative of h{y, 6) with respect to 5 is zero. Thus the dispersion parameter M-step is exact, 

f , T,driIew^[Zd]yd 



(50) 



Ed^MvIe^Z,) 



As for the general case, Poisson sLDA prediction requires computing E[^(?7'''Z)] . Since /u(-) and 
A{-) are identical, this is given in Equation 43. 



Exponential family response via the delta method With a general exponential family re- 
sponse one can use the multivariate delta method for moments to approximate difficult expecta- 
tions (Bickel and Doksum, 2007), a method which is effective in variational approximations (Braun 
and McAuliffe, 2010). In other work, Chang and Blei (2010) use this method with logistic regression 
to adapt sLDA to network analysis; Wang et al. (2009) use this method with multinomial regression 
for image classification. With the multivariate delta method, one can embed any generalized linear 
model into sLDA. 



4. Empirical study 

We studied sLDA on two prediction problems. First, we consider the "sentiment analysis" of news- 
paper movie reviews. We use the publicly available data introduced in Pang and Lee (2005), which 
contains movie reviews paired with the number of stars given. While Pang and Lee (2005) treat this 
as a classification problem, we treat it as a regression problem. 

Analyzing document data requires choosing an appropriate vocabulary on which to estimate the 
topics. In topic modeling research, one typically removes very common words, which cannot help 
discriminate between documents, and very rare words, which are unlikely to be of predictive impor- 
tance. We select the vocabulary by removing words that occur in more than 25% of the documents 
and words that occur in fewer than 5 documents. This yielded a vocabulary of 2180 words, a corpus 
of 5006 documents, and 908,000 observations. For more on selecting a vocabulary in topic models, 
see Blei and Lafferty (2009). 
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world . powerful . life . experience . tale . own . complex . nature . hiuman - 

award . opinions . academy . unbearable . expressed . mine . reflect . rated . totally - 

delightful . fine . profanity . cfiarming . romantic . woody . comedy . alien . funniest - 

theres . isnt . tfiats . entertainment . comic . motion . doesnt . screenplay . plot- 

employers . reflect . expressed . opinions . unbearable . mine . waste . totally . meant - 

Jeffrey . son . age . kids . disney . liked . ages . animated . slapstick - 

rated . acceptable . teenagers . language . send . letter . sexuality . movie . subscribe - 

reply . subscribe . details . mpaa . subject . running . message . rating . minutes - 

screenplay . emotional . theres . drama . characters . melodrama . despite . doesnt . unfortunately - • 

movie . story . films . time . director . characters . little . picture . dont - • 

didnt . cant . story . couldnt . wasnt . seeing . films . bad . dialogue- • 

waste . unbearable . opinions . employers . expressed . totally . reflect . poor . mine-» 

-2' -1' 

Topic and Coefficient 

Fig 2. A 12-topic sLDA model fit to the movie review data. 

Second, we studied the texts of amendments from the 109th and 110th Senates. Here the response 
variables are the discrimination parameters from an ideal point analysis of the votes (Clinton et al., 
2004). Ideal point analysis of voting data is used in quantitative political science to map senators to 
a real-valued point on a political spectrum. Ideal point models posit that the votes of each senator 
j are summarized with a real- valued latent variable Xj. The senator's vote on an issue i, which is 
the binary variable yij, is determined by a probit model, 

Uij ~ Probit(xj/3i + a,). 

Thus, each issue is connected to two parameters. For our analysis, we are most interested in the 
"issue discrimination" When /3j and Xj have the same sign, the senator j is more likely to vote 
in favor of the issue i. Just as Xj can be interpreted as a senator's point on the political spectrum, 
/3j can be interpreted as an issue's place on that spectrum. (The intercept term a, is called the 
"difficulty" and allows for bias in the votes, regardless of the ideal points of the senators.) 

We connected the results of an ideal point analysis to texts about the votes. In particular, we study 
the amendments considered by the 109th and 110th U.S. Senates.^ As a preprocessing step, we 
estimated the discrimination parameters fii for each amendment, based on the voting record. Each 
discrimination /3j is used as the response variable for its amendment text. We study the question: 
How well we can predict the political tone of an amendment based only on its text? 

^We collected our data — the amendment texts and the voting records — from the open government web-site 
www.govtrack.com. 



SUPERVISED TOPIC MODELS 



15 



debtor . tax . petition . debt . percent . age . iraq . rate . code - 



chapter . commission . violation . court . procedures . persons . deatli . evidence . members - 



secretary_of_defense . acquisition . airjorce . armedjorces . member . procurement . defense . publicjaw . army - 



energy . transportation . administrator . technologies . project . board . technology . management . research - 



alien . employer . employment . immigration_and . nationality . aliens . fee . insert . heading - 



equipment . states . grant . project . grants . resources . response . local . local_government - 



-1.0 



-0.5 



0.0 



0.5 



Topic and Coefficient 



Fig 3. A 5-topic sLDA model fit to the 109th U.S. Senate data. 



We pre-processed the Senate text in the same way as the reviews data. To obtain the response 
variables, we used the implementation of ideal point analysis in Simon Jackman's Political Science 
Computational Laboratory R package. For the 109th Senate, this yielded 288 amendments, a vocab- 
ulary of 2084 words, and 62,000 observations. For the 110th Senate, this yielded 213 amendments, 
a vocabulary of 1653 words, and 63,000 observations. 

For both reviews and senate amendments, we transformed the response to approximate normality 
by taking logs. This makes the data amenable to the continuous-response model of Section 2; for 
these two problems, generalized linear modeling turned out to be unnecessary. We initialized fii-K 
to randomly perturbed uniform topics, cj^ to the sample variance of the response, and r/ to a grid 
on [—1, 1] in increments oiljK. We ran EM until the relative change in the corpus- level likelihood 
bound was less than 0.01%. In the E-step, we ran coordinate-ascent variational inference for each 
document until the relative change in the per-document ELBO was less than 0.01%. 

We can examine the patterns of words found by models fit to these data. In Figure 2 we illustrate 
a 12-topic sLDA model fit to the movie review data. Each topic is plotted as a list of its most likely 
words, and each is attached to its estimated coefficient in the linear model. Words like "powerful" 
and "complex" are in a topic with a high positive coefficient; words like "waste" and "unbearable" 
are in a topic with high negative coefficient. In Figure 3 and Figure 4 we illustrate sLDA models fit 
to the Senate amendment data. These models are harder to interpret than the models fit to movie 
review data, as the response variable is likely less governed by the text. (Much more goes into a 
Senator's decision of how to vote.) Nonetheless, some patterns are worth noting. The health care 
amendments in the 110th Senate were a distinctly right wing issue; grants and immigration in the 
109th Senate were a left wing issue. 

We assessed the quality of the predictions using five fold cross-validation. We measured error in 
two ways. First, we measured correlation between the out-of-fold predictions with the out-of-fold 
response variables. Second, we measured "predictive R^." We defined this quantity as the fraction 
of variability in the out-of-fold response values which is captured by the out-of-fold predictions: 



We compared sLDA to linear regression on the 0^ from unsupervised LDA. This is the regression 
equivalent of using LDA topics as classification features (Blei et al., 2003; Fei-Fei and Perona, 2005). 
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We studied the quality of these predictions across different numbers of topics. Figures 5, 6 and 7 
illustrate that sLDA provides improved predictions on all data. The movie review rating is easier 
to predict that the U.S. Senates discrimination parameters. However, our study shows that there is 
predictive power in the texts of the amendments. 

Finally, we compared sLDA to the lasso, which is Li -regularized least-squares regression. The lasso 
is a widely used prediction method for high-dimensional problems (Tibshirani, 1996). We used each 
document's empirical distribution over words as its lasso covariates. We report the highest pR^ it 
achieved across different settings of the complexity parameter, and compare to the highest pR^ 
attained by sLDA across different numbers of topics. For the reviewer data, the best lasso pR^ was 
0.426 versus 0.432 for sLDA, a modest 2% improvement. On the U.S. Senate data, sLDA provided 
definitively better predictions. For the 109th U.S. Senate data the best lasso pR^ was 0.15 versus 
0.27 for sLDA, an 80% improvement. For the 110th U.S. Senate data, the best lasso pR^ was 0.16 
versus 0.23 for sLDA, a 43% improvement. Note moreover that the Lasso provides only a prediction 
rule, whereas sLDA models latent structure useful for other purposes. 

5. Discussion 

We have developed sLDA, a statistical model of labelled documents. The model accommodates the 
different types of response variable commonly encountered in practice. We presented a variational 
procedure for approximate posterior inference, which we then incorporated in an EM algorithm 
for maximum-likelihood parameter estimation. We studied the model's predictive performance, our 
main focus, on two real- world problems. In both cases, we found that sLDA improved on two natural 
competitors: unsupervised LDA analysis followed by linear regression, and the lasso. These results 
illustrate the benefits of supervised dimension reduction when prediction is the ultimate goal. 

We close with remarks on future directions. First, a "semi-supervised" version of sLDA — where 
some documents have a response and others do not — is straightforward. Simply omit the last two 
terms in Equation 14 for unlabelled documents and include only labelled documents in Equation 
18 and Equation 19. Since partially labelled corpora are the rule rather than the exception, this 
is a valuable avenue. (Though note, in this setting, that care must be taken that the response 
data exert sufficient influence on the fit.) Second, if we observe an additional fixed-dimensional 
covariate vector with each document, we can include it in the analysis just by adding it to the 
linear predictor. This change will generally require us to add an intercept term as well. Third, the 
technique we have used to incorporate a response can be applied in existing variants of LDA, such 
as author-topic models (Rosen-Zvi et al., 2004), population-genetics models (Pritchard et al., 2000), 
and survey-data models (Erosheva, 2002). We have already mentioned that sLDA has been adapted 
to network models (Chang and Blei, 2010) and image models (Wang et al., 2009). Finally, we are 
now studying the maximization of conditional likelihood rather than joint likelihood, as well as 
Bayesian nonparametric methods to explicitly incorporate uncertainty about the number of topics. 
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Fig 4. A 15-topic sLDA model fit to the 110th U.S. Senate data. 
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Fig 6. Error between out-of-fold predictions and observed response on the 109th U.S. Senate data. 
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Fig 7. Error between out-of-fold predictions and observed response on the 110th U.S. Senate data. 



